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We propose a novel method for translation selection in statistical machine translation, in which 
a convolutional neural network is employed to judge the similarity between a phrase pair in two 
languages. The specifically designed convolutional architecture encodes not only the semantic 
similarity of the translation pair, but also the context containing the phrase in the source 
language. Therefore, our approach is able to capture context-dependent semantic similarities 
of translation pairs. We adopt a curriculum learning strategy to train the model: we classify the 
training examples into easy, medium, and difficult categories, and gradually build the ability of 
representing phrase and sentence level context by using training examples from easy to difficult. 
Experimental results show that our approach significantly outperforms the baseline system by up 
to 1.4 BLEU points. 

1. Introduction 


In a conventional statistical machine translation (SMT) system, the translation model is con¬ 
structed in two steps ( Koehn et al. 2003) . First, bilingual phrase pairs respecting to the word 
alignments are extracted from a word-aligned parallel corpus. Second, the phrase pairs are 
assigned with scores calculated using their relative frequencies in the same corpus. However, 
only finding and utilizing translation pairs based on their surface forms is not sufficient: the 
conventional approach often fails to capture translation pairs which are grammatically and 
semantically similar. 

To alleviate the above problems, several researchers have proposed learning and utilizing 
semantically similar translation pairs in a continuous space ( jGao et al. 2014* [Zhang et al. 2014[ 
Cho et al. 2014b]>. The core idea is that the two phrases in a translation pair should share the same 


semantic meaning and have similar (close) feature vectors in the continuous space. A matching 
score is computed by measuring the distance between the feature vectors of the phrases, and is 
incorporated into the SMT system as an additional feature. 

The above methods, however, neglect the information of local contexts, which has been 
proven to be useful for disambiguating translation candidates during decoding ( |He et al. 2008[ 
Marton and Resnik 2008|l. The matching scores of translation pairs are treated the same, even 


they are in different contexts. Accordingly, the methods fail to adapt to local contexts and lead to 
precision issues for specific sentences in different contexts. 
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To capture useful context information, we propose a convolutional neural network architec¬ 
ture to measure context-dependent semantic similarities between phrase pairs in two languages. 
For each phrase pair, we use the sentence containing the phrase in source language as the context. 
With the convolutional neural network, we summarize the information of a phrase pair and its 
context, and further compute the pair’s matching score with a multi-layer perceptron. 

We discriminately train the model using a curriculum learning strategy. We classify the train¬ 
ing examples (i.e. triples (source phrase with its context, positive candidate, negative candidate)) 
according to the difficulty level of distinguishing the positive candidate (i.e. correct translation 
for the source phrase in the specific context) from the negative candidate (i.e. a bad translation 
in this context). Then we train the model to learn the semantic information from easy (basic 
semantic similarities between phrase pairs) to difficult (context-dependent semantic similarities). 

Experimental results on a large-scale translation task show that the context-dependent 
convolutional matching (CDCM) model improves the performance by up to 1.4 BLEU points 
over a strong phrase-based SMT system. Moreover, the CDCM model significantly outperforms 
its context-independent counterpart, proving that it is necessary to incorporate local contexts into 
SMT 

Contributions. Our key contributions include: 


we introduce a novel CDCM model to capture context-dependent semantic 
similarities between phrase pairs (Section|3]l; 

we develop a novel learning algorithm to train the CDCM model using a 
curriculum learning strategy (Section[4]). 


2. Related Work 


Our research builds on previous work in the field of context-dependent rule matching and 
bilingual phrase representations. 

There is a line of work that employs local contexts over discrete representations of words 
or phrases. For example, He et al. (2008[ >, Liu et al. (2008! and [Marton and Resnik (2 008) 
employed within-sentence contexts that consist of discrete words to guide rule matching. How¬ 
ever, these discrete context features usually suffer the data sparseness problem. In addition, these 
models treated each word as a distinct feature, which can not leverage the semantic similarity 
between words as our model. Wu et al. (2014) 1 exploited discrete contextual features in the 
source sentence (e.g. words and part-of-speech tags) to learn better bilingual word embeddings 
for SMT However, they only focused on frequent phrase pairs and induced phrasal similarities 
by simply summing up the matching scores of all the embraced words. In this study, we take 
into account all the phrase pairs and directly compute phrasal similarities with convolutional 
representations of the local contexts, integrating the strengths associated with the convolutional 
neural networks ( [Collobert and Weston 2008) . 

Another line of work focuses on capturing the document-level contexts via distributed 
representations. For instance, |Xiao et al. (2012| l and |Cui et al. (20l4| l incorporated document- 
level topic information to select more semantically matched rules. Although many sentences 
share the same topic with the document where they occur, there are a lot of sentences actually 
do have topics different from those of their documents ( |Xiong and Zhang 2013) 1. While these 
general contexts over the whole document may be not precise enough for the specific sentences 
in contexts different from the document, our approach is capable of learning the representations 
for different sentences respectively. Moreover, they learned distributed representations for docu¬ 
ments rather than phrases and derived distributed phrase representations from the corresponding 


2 
























Zhaopeng Tu 


Context-Dependent Translation Selection Using Convolutional Neural Network 


matching model 


Matching Score 

o 

MLP o OO 

ooooo 


convolutional 
sentence model 


representation 

J 


representation 



Figure 1 

Architecture of the CDCM model. The convolutional sentence model (bottom) summarizes the meaning of 
the tagged sentence and target phrase, and the matching model (top) compares the representations using a 
multi-layer perceptron. The symbol indicates the all-zero padding turned off by the gating function. 


documents, while we attempt to build and train a single, large neural network that reads phrase 
pairs with contexts and outputs the match degrees directly. 

In recent years, there has also been growing interest in bilingual phrase representations 
that group phrases with a similar meaning across different languages. Based on that translation 
equivalents share the same semantic meaning, they can supervise each other to learn their 
semantic phrase embeddings in a continuous space. For example, |Gao et al. (2014| > projected 
phrases from both source and target sides into a common, continuous space that is language 
independent. Although Zhang et al. (2014]) did not enforce the phrase embeddings from both 
sides to be in the same continuous space, they exploited a transformation between the two 
semantic embedding spaces. However, these models focused on capturing semantic similarities 
between phrase pairs in the global contexts, and neglected the local contexts, thus ignored 
the useful discriminative information. Alternatively, we integrate the local contexts into our 
convolutional matching architecture to obtain context-dependent semantic similarities. 

Meng et al. (2015]) and |Zhang (2015) have proposed independently to summary source 


sentences with convolutional neural networks. However, they both extend the neural network 
joint model (NNJM) of |Devlin et al. (2014] ) to include the whole source sentence, while we focus 
on capturing context-dependent semantic similarities of translation pairs. 


3. Context-Dependent Convolutional Matching Model 


The model architecture, shown in Figure [I] is a variant of the convolutional architecture of Hu et 
al. (2014). It consists of two components: 


convolutional sentence model that summarizes the meaning of the source sentence 
and the target phrase; 
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matching model that compares the two representations with a multi-layer 
perceptron ( Bengio 2009| . 


Let e be a target phrase and f be the source sentence that contains the source phrase aligning 
to e. We first project f and e into feature vectors x and y via the convolutional sentence model, 
and then compute the matching score s(x, y) by the matching model. Finally, the score is 
introduced into a conventional SMT system as an additional feature. 

Convolutional sentence model. As shown in Figure [T] the model takes as input the embeddings 
of words (trained beforehand elsewhere) in f and e. It then iteratively summarizes the meaning 
of the input through layers of convolution and pooling, until reaching a fixed length vectorial 
representation in the final layer. 

In Layer-1, the convolution layer takes sliding windows on f and e respectively, and models 
all the possible compositions of neighbouring words. The convolution involves a filter to produce 
a new feature for each possible composition. Given a fc-sized sliding window i on f or e, for 
example, the jth convolution unit of the composition of the words is generated by: 

ci (1,j) = <7(0/°-*) ■ ^(w (1J) • Cj (0) + ( 1 ) 


where 


g(-) is the gate function that determines whether to activate </>(•); 


(!>(■) is a non-linear activation function. In this work, we use ReLu (Dahl et al. 
|2Q13[ > as the activation function; 

is the parameters for the j th convolution unit on Layer-1, with matrix 


• is a vector constructed by concatenating word vectors in the /c-sized sliding 
widow i\ 

• is a bias term, with vector = [b^ 1,1 ),.... b^ 1,7 )]. 


To distinguish the phrase pair from its context, we use one additional dimension in word 
embeddings: 1 for words in the phrase pair and 0 for the others. After transforming words to their 
tagged embeddings, the convolutional sentence model takes multiple choices of composition 
using sliding windows in the convolution layer. Note that sliding windows are allowed to cross 
the boundary of the source phrase to exploit both phrasal and contextual information. 

In order to avoid the length variability of source sentences and target phrases, we add all¬ 
zero paddings at the end of the source sentence and target phrase until their maximum length. 
Moreover, we use the gate function g(-) to eliminate the effect of the all-zero padding by setting 
output vector to all-zeros if the input is all-zeros. 

In Layer-2, we apply a local max-pooling in non-overlapping 1x2 windows for every 
convolution unit 


„(2j) 


Cj"“' = maX { C 2i > C 2i+l} 

In Layer-3, we perform convolution on output from Layer-2: 

c d 3 d) _ g(c..( 2 )) • • c/ 2) + b^ 3 ’^) 


( 2 ) 


(3) 
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After more convolution and max-pooling operations, we obtain two feature vectors for the source 
sentence and the target phrase, respectively. 

Matching model. The matching score of a source sentence and a target phrase can be measured 
as the similarity between their feature vectors. Specifically, we use the multi-layer perceptron 
(MLP), a nonlinear function for similarity, to compute their matching score. First we use one 
layer to combine their feature vectors to get a hidden state h c . 

h c = 4>{w c ■ [xjr. : y g J + b c ) (4) 

Then we get the matching score from the MLP: 

s(x,y) = MLP(h c ) (5) 


4. Training 


Ideally, the trained CDCM model is expected to assign a higher matching score to a positive 
example (a source phrase in a specific context f and its correct translation e + ), and a lower score 
to a negative example (the source phrase and a bad translation e~ in the specific context). To this 
end, we employ a discriminative training strategy with a max-margin objective. 

Suppose we are given the following triples (x, y + , y ) from the oracle, where x, y + . y are 
the feature vectors for f, e + , e respectively. We have the ranking-based loss as objective: 

£e(x,y + , y - ) = max(0,1 + s(x,y~) - s(x, y + )) (6) 

where s(x, y) is the matching score function defined in Eq. [5] 0 consists of parameters for 
both the convolutional sentence model and MLP. The model is trained by minimizing the above 
objective, to encourage the model to assign higher matching scores to positive examples and to 
assign lower scores to negative examples. We use stochastic gradient descent (SGD) to optimize 
the model parameters 0. 

Note that the CDCM model aims at capturing contextual representations that can distinguish 
good translation candidates from bad ones in various contexts. To this end, we propose a two- 
step approach. First, we initialize the model with context-dependent bilingual word embeddings 
to start with strong contextual and semantic equivalence at the word level (Section [4~T| >. Second, 
we train the CDCM model with a curriculum strategy to learn the context-dependent semantic 
similarity at the phrase level from easy (basic semantic similarities between the source and target 
phrase pair) to difficult (context-dependent semantic similarities for the same source phrase in 
varying contexts) (Section [4~2j ). 

4.1 Initialization by Context-Dependent Bilingual Word Embeddings 


Model initialization plays a critical role in a non-convex problem. The initialization of the CDCM 
model is the embeddings of words on both languages, a real-value and dense representation 
of words. Typical word embeddings are trained on monolingual data ( |Mikolov et al. 2013| ), 
thus fails to capture the useful semantic relationship across languages. It has been shown that 
bilingual word embeddings represent a substantial step in better capturing semantic equivalence 
at the word level ( Zou et al. 2013| Wu et al. 2014[ >, thus could initialize our model with strong 
semantic information. Bilingual word embeddings refer to the semantic embeddings associated 
across two languages so that similar units in each language and across languages have similar 
representations. |Zou et al. (2013) utilized MT word alignments to encourage pairs of frequently 
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Figure 2 

Architecture of the CDCM bilingual word embedding model. 


aligned words to have similar word embeddings, while |Wu et al. (2014| > improved bilingual word 
embeddings with discrete contextual information. 

Inspired by the above studies, we propose a context-dependent bilingual word embedding 
model that exploits both the word alignments and contextual information, as shown in Figure [2] 
Given an aligned word pair (/,, e_,), the context is extracted from the nearby window on each 
side (the left two words and the right two words in this work). Let f t = /j_ 2 , fi-i, fi , fi+i, fi +2 
and Bj = ej_ 2 > e j-h e j> e j+i> e j+ 2 be the contextual sequence for the above word pair. We get 
their vectorial representations by: 


x/. = ( p(w f ■ Le(fi) + h f ) (7) 

y Sj = cf( w e • Le(e“ ) + b e ) (8) 

where Le(-) converts word sequences into embeddings and returns a vector by concatenating the 
embeddings. 

Similarly, we calculate matching score for x^. and y f:j according to Eq. |4]and Eq.[5j The 
bilingual word embedding model is also trained by minimizing the objective in Eq. pi The 
negative examples are constructed by replacing either fi or e : j with words randomly chosen 
from the corresponding vocabulary. 


4.2 Curriculum Training 


Curriculum learning, first proposed by Bengio et al. (20091 in machine learning, refers to a 
sequence of training strategies that start small, learn easier aspects of the task, and then gradually 
increase the difficulty level. It has been shown that the curriculum learning can benefit the non- 
convex training by giving rise to improved generalization and faster convergence. The key point 
is that the training examples are not randomly presented but organized in a meaningful order 
which illustrates gradually more concepts, and gradually more complex ones. 

For each positive example (f, e + ), we have three types of negative examples according to 
the difficulty level of distinguishing the positive example from them: 


Easy: target phrases randomly chosen from the phrase table; 
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Algorithm 1 Curriculum training algorithm. Here T denotes the training examples, W the initial 
word embeddings, 77 the learning rate in SGD, n the pre-defined number, and t the number of 
training examples. 


1 

procedure CURRICULUM-TRAINING(T, W) 


2 

Ni <— easv_negative(T ) 


3 

N 2 medium_negative(T ) 


4 

N 3 difficult negative(T ) 


5 

T 


6 

CURRICULUM)! 1 , n ■ t ) 

> CUR. easy 

7 

T N 2 }) 


8 

CURRICULUM)! 1 , n ■ t) 

> CUR. medium 

9 

for step <— 1... n do 


10 

T <r- MIX([JVi, N 2 , N3], step) 


11 

CURRICULUM(T, 0 

> CUR. difficult 

12 

procedure CURRICULUM(T, I<) 


13 

iterate until reaching a local minima or K iterations 


14 

calculate Lq for a random instance in T 


15 


[> update parameters 

16 

W = W- 

> update embeddings 

17 

procedure MIX(N, s = 0) 


18 

len <— length of N 


19 

ifien < 3 then 


20 

T <— sampling with [0.5, 0.5] from N 


21 

else 


22 

T t- sampling with [—, ^from N 



• Medium: target phrases extracted from the aligned target sentence for other 
non-overlap source phrases in the source sentence; 

• Difficult: target phrases extracted from other candidates for the same source 
phrase. 

We want the CDCM model to learn the following semantic information from easy to difficult: 

• the basic semantic similarity between the source sentence and target phrase from 
the easy negative examples; 

• the general semantic equivalent between the source and target phrase pair from 
the medium negative examples; 

• the context-dependent semantic similarities for the same source phrase in varying 
contexts from the difficult negative examples. 

Alg|H shows the curriculum training algorithm for the CDCM model. We use different 
portions of the overall training instances for different curriculums (lines 2-11). For example, we 
only use the training instances that consist of positive examples and easy negative examples in 
the easy curriculum (lines 5-6). For the latter curriculums, we gradually increase the difficulty 
level of the training instances (lines 7-12). 

For each curriculum (lines 12-16), we compute the gradient of the loss objective Lq and 
learn 0 using the SGD algorithm. Note that we meanwhile update the word embeddings to 
better capture the semantic equivalence across languages during training. If the loss function 
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Lq reaches a local minima or the iterations reach the pre-defined number, we terminate this 
curriculum. 


5. Experiments 

In this section, we try to answer two questions: 


1 Does the proposed approach achieve higher translation quality than the baseline 
system? Does the approach outperform its context-independent counterpart? 

2 Does model initialization by bilingual word embeddings outperforms its 
monolingual counterpart in terms of translation quality? 


In Section 5.2 we evaluate our approach on a Chinese-English translation task. By using 
the CDCM model, our approach achieves significant improvement in BLEU score by up to 1.4 
points. Moreover, the CDCM model significantly outperforms its context-independent counter¬ 
part, confirming our hypothesis that local contexts are very useful for machine translation. 

In Section |5.3| we compare model initializations by bilingual word embeddings and by 
conventional monolingual word embeddings. Experimental results show that the initialization 
by bilingual word embeddings outperforms its monolingual counterpart consistently, indicating 
that bilingual word embeddings give a better initialization of the CDCM model. 


5.1 Setup 


We carry out our experiments on the NIST Chinese-English translation tasks. Our train¬ 
ing data contains 1.5M sentence pairs coming from LDC dataset. The corpus includes 
LDC2002E18, LDC2003E07, LDC2003E14, Hansards portion of LDC2004T07, LDC2004T08 
and LDC2005T06. We train a 4-gram language model on the Xinhua portion of the GIGAWORD 
corpus using the SRI Language Toolkit ( jStolcke 2002] ). We use the 2002 NIST MT evaluation test 
data as the development data, and the 2004, 2005 NIST MT evaluation test data as the test data. 


We use minimum error rate training (Och 2003) to optimize the feature weights. For evaluation, 
case-insensitive NIST BLEU (Papineni et al. 2002 1 is used to measure translation performance. 

For training the neural networks, we use 4 convolution layers for source sentences and 3 
convolution layers for target phrases. For both of them, 4 pooling layers (pooling size is 2) are 
used, and all the feature maps are 100. We set the sliding window k = 3, and the learning rate 
77 = 0.02. All the parameters are selected based on the development data. To produce high-quality 
bilingual phrase pairs to train the CDCM model, we perform forced decoding on the bilingual 
training sentences and collect the used phrase pairs. We obtain 2.4M unique phrase pairs (length 
ranging from 1 to 7) and 20.2M phrase pairs in different contexts. Since the curriculum training 
in the CDCM model requires that each source phrase should have at least two corresponding 
target phrases, we obtain 13.5M phrase pairs after we remove the undesirable ones. 


5.2 Evaluation of Translation Quality 

We have two baseline systems: 


Baseline : The baseline system is an open-source system of the phrase-based 
model - Moses (Koehn et al. 20071 with a set of common features, including 
translation models, word and phrase penalties, a linear distortion model, a 
lexicalized reordering model, and a language model. 


8 
















Zhaopeng Tu 


Context-Dependent Translation Selection Using Convolutional Neural Network 


Models 

MT04 

MT05 

All 

Baseline 

34.86 

33.18 

34.40 

CICM 

35.82“ 

33.51“ 

34.95“ 

CDCMi 

cdcm 2 

cdcm 3 

35.87“ 

35.97“ 

36.26 a/3 

33.58 

33.80“ 

33.94“^ 

35.01“ 

35.21“ 

35.40“^ 


Table 1 

Evaluation of translation quality. CDCM fc denotes the CDCM model trained in the fcth curriculum, CICM 
denotes its context-independent counterpart, and “All” is the combined test sets. The superscripts a and j3 
indicate statistically significant difference (p < 0.05) from Baseline and CICM, respectively. 


sentence 

\ PO-JL MiT A ftfS: #11 

-it st&7T ® fa 7 fta 

references 

incorrect, faulty, wrong, erroneous 

a mistake, mistakes 

TM 

CICM 

CDCMi 

CDCMi 

CDCMs 

wrong (1143), mistakes (361), mistake (314) 

was wrong (7), is wrong (8), the wrong (134) 

a wrong (44), wrong (1143), its mistake (12) 

wrong (1143), the mistake (30), a wrong (44) 
false (42), wrong (1143), faulty (16) 

a mistake (16), by mistake (5), the mistake (30) 

the erroneous (31), a mistake (16), the mistake (30) 

mistake (314), error (162), fault (14) 


sentence 

*+. -tk frit 

4un % — -f- &£ A && 

references 

the key point is 

focus is 

TM 

focus is (10), focus on (8), focuses on (6) 

CICM 

the key point is (3), key point 

is (3), where the focus is (2) 

CDCMi 

CDCMi 

cdcm 3 

the focus is (4), focus was (2), where the focus is (2) 
where the focus is (2), is mainly (2), priority is (2) 

the key point is (3), the focus is (4), main point is 
that (2) 

focus was (2), the focus is (4), focuses on a (2) 
focus was (2), focus is (10), priority is (2) 

focus is (10), priority is (2), focus of (2) 

Figure 3 




The top ranked target phrases according to the translation model (TM) and the CDCM model. The number 
in the bracket is the co-occurrence of the source-target phrase pair in the corpus. 


CICM (context -independent convolutional matching) model: Following the 
previous works ( Gao et al. 201 4| Zhang et al. 2014| Cho et al. 2014b) >, we 
calculate the matching degree of a phrase pair without considering any contextual 
information. Each unique phrase pair serves as a positive example and a randomly 
selected target phrase from the phrase table is the corresponding negative example. 
The matching score is also introduced into Baseline as an additional feature. 


Table [T]summaries the results of CDCMs trained from different curriculums. No matter from 
which curriculum it is trained, the CDCM model significantly improves the translation quality 
on the overall test data (with gains of 1.0 BLEU points). The best improvement can be up to 1.4 
BLEU points on MT04 with the fully trained CDCM. As expected, the translation performance 
is consistently increased with curriculum growing. This indicates that the CDCM model indeed 
captures the desirable semantic information by the curriculum learning from easy to difficult. 
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Models 

Monolingual 

MT04 MT05 All 

MT04 

3i lingual 
MT05 

All 

CDCMi 

35.74 

33.38 

34.85 

35.87 

33.58 

35.01 

cdcm 2 

35.80 

33.59 

35.04 

35.97 

33.80 

35.21 

CDCM 3 

35.95 

33.65 

35.14 

36.26 

33.94 

35.40 


Table 2 

Comparison of the monolingual word embeddings ( Monolingual ) and the bilingual word embeddings 
( Bilingual ) in terms of translation quality. 



Monolingual Word Embedding 

Bilingual Word Embedding 

sentence 



references 

the key point is 

TM 

focus is (10), focus on (8), focuses on (6) 

CDCMi 

CDCM2 

CDCM3 

focus (3), focus is (10), focus was (2) 
is mainly (2), emphasis is (4), important thing is (2) 

main point is that the (2), main point is that (2), main 
point is (2) 

the focus is (4), focus was (2), where the focus is (2) 
where the focus is (2), is mainly (2), priority is (2) 

the key point is (3), the focus is (4), main point is 
that (2) 


Figure 4 

The top ranked target phrases according to the CDCM models with different initilizations. 


Comparing with its context-independent counterpart (CICM, Row 2), the CDCM model 
shows significant improvement on all the test data consistently. We contribute this to the in¬ 
corporation of useful discriminative information embedded in the local context. In addition, the 
performance of CICM is comparable with that of CDCMi. This is intuitive, because both of them 
try to capture the basic semantic similarity between the source and target phrase pair. 

Qualitative Analysis. Figure [3] lists some interesting cases to show why the CDCM model 
improves the performance. We analyze the phrase pair scores computed by the CDCM model 
against the phrase translation probabilities from the translation model. First, the CDCM model 
scores phrase pairs based rather on the semantic similarity and the contextual information than 
on their co-occurrences in the corpus. Therefore, it is complementary to the translation model. 
Second, with the growing of curriculum, our model is more likely to capture the context- 
dependent semantic similarities between phrase pairs. In most cases, the choices of translation 
candidates by the fully trained CDCM model (i.e. CDCM 3 ) are closer to actual translations for 
both frequent and less frequent phrases. Third, though the CICM model captures the semantic 
similarities between phrase pairs, it fails to adapt to different local contexts as well. In contrast, 
the CDCM model is able to provide different translation candidates based on the discriminative 
information embedded in the local contexts. 


5.3 Evaluation of Bilingual Word Embeddings 


In this section, we will investigate the influence of the bilingual word embeddings we use to 
initialize the CDCM model. We use the Word2Vec ( Mikolov et al. 2013) to train the monolingual 
word embeddings. We train the bilingual word embeddings using the approach described in 
Section |4~T| Dimensions of both bilingual and monolingual embeddings are 50. 

Table [2] shows the comparative results between bilingual and monolingual word embed¬ 
dings. As seen, our bilingual word embedding model outperforms its monolingual counterpart 
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consistently. Zou et al. (201 3} and Wu et al. (2014[ > reported that word-level semantic relation¬ 
ships across languages, captured by the bilingual word embeddings, boost machine translation 
performance. Our results reconfirm these findings. 

Qualitative Analysis. Figure [4] lists some cases to show why the context-dependent bilingual 
word embeddings produce consistent improvements. As seen, the CDCM model initialized by 
bilingual word embeddings produces more discriminative results than its monolingual coun¬ 
terpart. Take CDCM 3 as an example, the monolingual word embeddings scenario prefers the 
candidates that contain “ main point is", while its bilingual counterpart selects different candidates 
that share the same semantic meaning. One possible reason is that bilingual and contextual 
information helps to capture the semantic relationships between words across languages ( |Yang| 
|et al. 2013) , thus better phrasal similarities by using principle of compositionality. 


5.4 Discussion 


Convolutional Model vs. Recursive Model. Previous works on bilingual phrase representations 
usually employ Recurrent Neural Network (RNN) ( Cho et al. 2014b| o r Recursive AutoEncoder 
(RAE) (Zhang~t al. 2014 j). It h as been observed in ( |Kalchbrenner and Blunsom 201 3[ Sutskever 
|et al. 2014 Cho et al. 2014a| ) that the recursive approaches suffer from a significant drop in 
translation quality when translating long sentences. In contrast, Kalchb renner et al. (2014| > show 
that the convolutional model could represent the semantic content of a long sentence accurately. 
Therefore, we choose the convolutional architecture to model the meaning of sentence. 
Limitations. Unlike recursive models, the convolutional architecture has a fixed depth, which 
bounds the level of composition. In this task, this limitation can be largely compensated with a 
network afterwards that can take a “global” synthesis on the learned sentence representation. 

One of the hypotheses we tested in the course of this research was disproved. We thought it 
likely that the difficult curriculum (i.e. distinguish the correct translation from other candidates 
for a given context) would contribute most to the improvement, since this circumstance is more 
consistent with the real decoding procedure. This turned out to be false, as shown in Table[I] One 
possible reason is that the “negative” examples (other candidates for the same source phrase) may 
share the same semantic meaning with the positive one, thus give a wrong guide in the supervised 
training. Constructing a reasonable set of negative examples that are more semantically different 
from the positive one is left for our future work. 


6. Conclusion 

In this paper, we propose a context-dependent convolutional matching model to capture semantic 
similarities between phrase pairs that are sensitive to contexts. Experimental results show that 
our approach significantly improves the translation performance and obtains improvement of 1.0 
BLEU scores on the overall test data. 

Integrating deep architecture into context-dependent translation selection is a promising way 
to improve machine translation. This paper is the first step in what we hope will be a long and 
fruitful journey. In the future, we will try to exploit contextual information at the target side (e.g., 
partial translations). 
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